Description:

This tutorial contains data processing, visualization and differental expression analysis of single-end RNA-seq data. The RNA-seq count data comes from plant seedling tissues of maize natural mutant Ufo1-1 and wild-type (Wt) control.You can find additional information about Ufo1-1 from this paper. I created my own transcriptome assembly using StringTie https://ccb.jhu.edu/software/stringtie/, hense the “MSTRG” gene_id in the count data. The reason I did that was because I not only want to study genes found in the reference genome, but also characterize new transcripts. I did pseudo-alignment and quantify transcript counts using Kallisto https://pachterlab.github.io/kallisto/manual. Gene counts are the sum of trancript pseudocounts. These counts are not in integers, so we need to round them first. Alternatively, you scan also generate raw count data using HTSeq https://htseq.readthedocs.io/en/release_0.10.0/ or featurecounts http://bioinf.wehi.edu.au/featureCounts/. However, please do not use bedtools as it usually counts ambiguous alignments.

Some additional reading on statistical analysis of RNA-seq:

http://www.nathalievialaneix.eu/doc/pdf/tutorial-rnaseq.pdf

# pacman is a good package for loading packages
library(pacman)
pacman::p_load(edgeR, RColorBrewer, gplots, ggplot2, reshape2, DT, cowplot,
               limma, DESeq, DESeq2, data.table, e1071, ComplexHeatmap, VennDiagram)

Source R script

I have writen several functions in an external R script, we can use those function by sourcing the function

source("https://raw.githubusercontent.com/CJinnny/RNA-seq-tutorial/master/RNA_seq_tutorial_functions.R")

Data preprocessing

In order to know gene functions, we can match MSTRG id into maize reference genome id (v4), which is already well characterized. Additionally, I also include maize reference genome V3 id.

require(data.table)
data <- fread("https://raw.githubusercontent.com/CJinnny/RNA-seq-tutorial/master/genes.txt", sep = "\t")
data[,6:15] = round(data[,6:15])

id_conv <- fread("https://raw.githubusercontent.com/CJinnny/RNA-seq-tutorial/master/https://raw.githubusercontent.com/CJinnny/RNA-seq-tutorial/master/all_MSTRG_v4_v3.csv", header = TRUE, sep = ",")
id_conv <- data.frame(id_conv, row.names = 1)
functions <- fread("https://raw.githubusercontent.com/CJinnny/RNA-seq-tutorial/master/B73v4_gene_function.txt", header = FALSE, sep = "\t")
functions <- subset(functions, select = c("V1", "V2"))
colnames(id_conv) <- c("v4_gene_id", "v3_gene_id", "MSTRG_gene_id")
colnames(functions) <- c("v4_gene_id", "gene_function")
annotations = merge(id_conv, functions, by = "v4_gene_id", all.x = TRUE)
data <- merge(data, annotations, by = "MSTRG_gene_id", all.x = TRUE)
data <- aggregate(data, 
                  by = list(data$MSTRG_gene_id),
                  FUN = aggregate_func
                  )
data <- na.omit(data, cols="v4_gene_id")
data <- data.frame(data, row.names = 1)
setnames(data, old = c("UU1", "UU2", "UU3", "UU4", "UU5", "WW1", "WW2", "WW3", "WW4", "WW5"), 
         new = c("U1", "U2","U3","U4","U5", "W1", "W2","W3","W4","W5"))


datatable(data, rownames = FALSE, width=0.5)

Normality testing with skewness & kurtosis

counts <- data[,6:15]

require(e1071)
data_diagnosis(counts)
## skewness is:
##  36.63689 41.82896 21.72463 34.99034 22.78793 42.0665 33.39914 37.64356 27.01962 54.97658 
## kurtosis is:
##  2199.034 3028.689 865.9139 1964.686 867.6066 2758.378 1949.117 2319.712 1375.454 4894.567

Data transformation

Data is skewed (high skewness and high kurtosis).

We need to do data transformations for PCA plot, otherwise outliers will have a great impact on the clustering.

log transformation: logcounts = log2(counts + 1).

DESeq2::rlog(): “regularized log” transformation. For more information see https://rdrr.io/bioc/DESeq2/man/rlog.html

edgeR::cpm(): “counts per million” transformation. For more information see https://rdrr.io/bioc/edgeR/man/cpm.html

DESeq2:varianceStabilizingTransformation(): “variance stabilizing transformation”. For more information see https://rdrr.io/bioc/DESeq2/man/varianceStabilizingTransformation.html

require(DESeq2)
require(edgeR)
logcounts = log2(counts + 1)
rlogcounts = rlog(as.matrix(counts))
rownames(rlogcounts) = rownames(logcounts)
cpmcounts = cpm(as.matrix(counts), prior.count = 2, log = TRUE)
vstcounts = varianceStabilizingTransformation(as.matrix(counts))
data_diagnosis(logcounts)
## skewness is:
##  -0.5997057 -0.6048998 -0.6093726 -0.4912253 -0.6109366 -0.5945748 -0.5262748 -0.5730369 -0.5900014 -0.552007 
## kurtosis is:
##  -0.7826695 -0.7977504 -0.8067969 -0.9002022 -0.7725521 -0.7831297 -0.9279888 -0.8513029 -0.8382193 -0.8962603

data_diagnosis(rlogcounts)
## skewness is:
##  -0.6679657 -0.6836715 -0.6832402 -0.6763476 -0.6779365 -0.6773912 -0.6746331 -0.6697096 -0.6842158 -0.671056 
## kurtosis is:
##  -0.6403723 -0.6370364 -0.6450905 -0.6332838 -0.633821 -0.6331829 -0.6479403 -0.6440739 -0.6427632 -0.6518892

data_diagnosis(cpmcounts)
## skewness is:
##  -0.4285089 -0.4730039 -0.4773158 -0.4526601 -0.4639736 -0.4572871 -0.4528065 -0.4374913 -0.4839089 -0.438765 
## kurtosis is:
##  -0.9539685 -0.9381164 -0.9576858 -0.9321906 -0.9269013 -0.9280453 -0.9964032 -0.9797621 -0.9533996 -1.003599

data_diagnosis(vstcounts)
## skewness is:
##  0.3270751 0.2803435 0.27878 0.3042513 0.3098411 0.308709 0.2888468 0.312236 0.2699977 0.2917919 
## kurtosis is:
##  -0.5191474 -0.5951802 -0.6988962 -0.5030264 -0.550126 -0.5454259 -0.6343581 -0.5475308 -0.6730192 -0.6279794

Draw PCA (Principle Component Analysis) plot

Data normalization helps with separation

require(graphics)
require(RColorBrewer)
par(mfrow=c(2,3), mar=c(5.1, 4.6, 4.1, 1.6))
draw_PCA(counts, title = "PCA on raw data")
draw_PCA(logcounts, title = "PCA on log transformed data")
draw_PCA(rlogcounts, title = "PCA on rlog transformed data")
draw_PCA(cpmcounts, title = "PCA on cpm transformed data")
draw_PCA(vstcounts, title = "PCA on vst transformed data")

Draw MDS (multidimensional scaling) plot

par(mfrow=c(2,3))
plotMDS(counts, col = c(rep("red", 5), rep("blue", 5)), cex = 1.5)
title("MDS plot on raw data")
plotMDS(logcounts, col = c(rep("red", 5), rep("blue", 5)), cex = 1.5)
title("MDS plot on log transformed data")
plotMDS(rlogcounts, col = c(rep("red", 5), rep("blue", 5)), cex = 1.5)
title("MDS plot on rlog transformed data")
plotMDS(cpmcounts, col = c(rep("red", 5), rep("blue", 5)), cex = 1.5)
title("MDS plot on cpm transformed data")
plotMDS(vstcounts, col = c(rep("red", 5), rep("blue", 5)), cex = 1.5)
title("MDS plot on vst transformed data")

Draw correlation heatmap

require(gplots)
draw_corr_heatmap(as.matrix(counts), show_cellnote = TRUE, title = "clustering sample-to-sample\n distance on raw data")

draw_corr_heatmap(as.matrix(logcounts), show_cellnote = TRUE, title = "clustering sample-to-sample\n distance on log transformed data")

draw_corr_heatmap(rlogcounts, show_cellnote = TRUE, title = "clustering sample-to-sample\n distance on rlog transformed data")

draw_corr_heatmap(cpmcounts, show_cellnote = TRUE, title = "clustering sample-to-sample\n distance on cpm transformed data")

draw_corr_heatmap(vstcounts, show_cellnote = TRUE, title = "clustering sample-to-sample\n distance on vst transformed data")

Set conditions for DE analysis

Here I have to relevel the conditions because by default it considers the first factor to be the reference

conditions = factor(c(rep("Ufo",5), rep("Wt",5)))
conditions = relevel(conditions, ref="Wt")

Filter low-count reads

I’m using cpm normalization method, filter out non-expressed genes (0 counts), you can also try more stringent filtering (e.g. cpm > 1 in at least 5 samples)

keep <- rowSums(cpm(counts)>0) >= 5
table(keep)
## keep
## FALSE  TRUE 
##  3584 23106
keep_true = data.frame(keep[which(keep == TRUE)])
filter = subset(counts, rownames(counts) %in% rownames(keep_true))

DESeq

require(DESeq)
cds = newCountDataSet(filter, conditions)
cds = DESeq::estimateSizeFactors(cds)
cds = DESeq::estimateDispersions(cds)
DESeq_res = DESeq::nbinomTest(cds, "Wt", "Ufo")
rownames(DESeq_res) = DESeq_res$id
DESeq_DE = subset(DESeq_res, (log2FoldChange < -1 & padj < 0.05) | (log2FoldChange > 1 & padj < 0.05) )
DESeq_nc = counts(cds, normalized = TRUE)
DESeq_nc = data.frame("id" = rownames(DESeq_nc), DESeq_nc)
DESeq_nc_DE = subset(DESeq_nc, id %in% DESeq_DE$id)

DESeq2

require(DESeq2)
colData = data.frame(samples = colnames(filter), conditions = conditions)
dds = DESeqDataSetFromMatrix(countData = filter, colData = colData, design = ~conditions)
## converting counts to integer mode
dds = DESeq2::estimateSizeFactors(dds)
dds = DESeq2::estimateDispersions(dds)
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
dds = DESeq2::nbinomWaldTest(dds)

DESeq2_res = DESeq2::results(dds)
DESeq2_res = data.frame("id" = rownames(DESeq2_res), DESeq2_res)
DESeq2_DE = subset(DESeq2_res, (log2FoldChange < -1 & padj < 0.05) | (log2FoldChange > 1 & padj < 0.05) )
DESeq2_nc = counts(dds, normalized = TRUE)
DESeq2_nc = data.frame("id" = rownames(DESeq2_nc), DESeq2_nc)
DESeq2_nc_DE = subset(DESeq2_nc, id %in% DESeq2_DE$id)

edgeR

require(edgeR)
group = as.vector(conditions)
dge = DGEList(counts = filter, group = group)
dge = calcNormFactors(dge)
dis = estimateCommonDisp(dge)
tag = estimateTagwiseDisp(dis)
etx = exactTest(tag, pair = c("Wt", "Ufo"))
edgeR_res = etx$table
edgeR_res$FDR = p.adjust(edgeR_res$PValue, method = "BH")
edgeR_res = data.frame("id" = rownames(edgeR_res), edgeR_res)
edgeR_DE = subset(edgeR_res, (logFC < -1 & FDR < 0.05) | (logFC > 1 & FDR < 0.05) )
edgeR_nc = tag$pseudo.counts
edgeR_nc = data.frame("id"=rownames(edgeR_nc), edgeR_nc)
edgeR_nc_DE = subset(edgeR_nc, id %in% edgeR_DE$id)

limma

require(limma)
design = model.matrix(~conditions)
voom = voom(filter, design, normalize="quantile")
fit = lmFit(voom, design)
fit = eBayes(fit)
limma_res = topTable(fit, coef = NULL, n = Inf)
## Removing intercept from test coefficients
limma_res = data.frame("id" = rownames(limma_res), limma_res)
limma_DE = subset(limma_res, (logFC < -1 & adj.P.Val < 0.05) | (logFC > 1 & adj.P.Val < 0.05) )
limma_nc = 2**voom$E
limma_nc = data.frame("id" = rownames(limma_nc), limma_nc)
limma_nc_DE = subset(limma_nc, id %in% limma_DE$id)

Compare DESeq, DESeq2, edgeR and limma DEG results

dflist <- list(DESeq=DESeq_DE, DESeq2=DESeq2_DE, edgeR=edgeR_DE, limma=limma_DE)
Compare_method <- join_id(dflist)
## Loading required package: plyr
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:matrixStats':
## 
##     count
## The following object is masked from 'package:IRanges':
## 
##     desc
## The following object is masked from 'package:S4Vectors':
## 
##     rename
Compare_method$message
## [1] "There are 282 genes DE in all 4 methods 114 genes in 3, 245 genes in 2, 136 genes in 1."

Displaying table using DT library

require(DT)
Compare_summary <- merge(Compare_method$merged_table, annotations, by.x = "id", by.y = "MSTRG_gene_id", all.x = TRUE)
datatable(Compare_summary,
          rownames = FALSE,
          caption = htmltools::tags$caption(
            style = "caption-side: bottom; text-align: left;",
            "Table 1: Summary of DEGs found in four different methods."),
          options = list(
            searchHighlight = TRUE
          )
          )

Add log2FoldChange

Compare_summary <- merge(Compare_summary, DESeq_res[,c("id", "log2FoldChange","padj")], by = "id")
setnames(Compare_summary, old = c("log2FoldChange", "padj"), new = c("DESeq_logFC", "DESeq_padj"))
Compare_summary <- merge(Compare_summary, DESeq2_res[,c("id", "log2FoldChange","padj")], by = "id")
setnames(Compare_summary, old = c("log2FoldChange", "padj"), new = c("DESeq2_logFC", "DESeq2_padj"))
Compare_summary <- merge(Compare_summary, edgeR_res[,c("id", "logFC", "FDR")], by = "id")
setnames(Compare_summary, old = c("logFC", "FDR"), new = c("edgeR_logFC", "edgeR_padj"))
Compare_summary <- merge(Compare_summary, limma_res[,c("id", "logFC", "adj.P.Val")], by = "id")
setnames(Compare_summary, old = c("logFC", "adj.P.Val"), new = c("limma_logFC", "limma_padj"))
Compare_summary = Compare_summary[,c(1,9,10,11,12,13,14,15,16,6,7,8)]
Compare_summary = aggregate(Compare_summary, by = list(Compare_summary$id), FUN = aggregate_func)
Compare_summary$id <- NULL
setnames(Compare_summary, old = "Group.1", new = "id")
datatable(Compare_summary, filter = "top",
          rownames = FALSE,
          caption = htmltools::tags$caption(
            style = "caption-side: bottom; text-align: left;",
            "Table 2: Summary of logFC values of DEGs."),
          options = list(
            autoWidth = TRUE,
            searchHighlight = TRUE,
            scrollX = TRUE
            )
          )

Draw Venn Diagram

require(VennDiagram)
draw_venndiagram(dflist, Compare_method$merged_table)

Draw heatmap for DEGs in separate plots

Cluster for Up & Down DEGs, respectively

require(ComplexHeatmap)
Ht1 = DE_heatmap(DESeq_nc_DE, title = "DESeq", km = 2)
Ht2 = DE_heatmap(DESeq2_nc_DE, title = "DESeq2", km = 2)
Ht3 = DE_heatmap(edgeR_nc_DE, title = "edgeR", km = 2)
Ht4 = DE_heatmap(limma_nc_DE, title = "limma", km = 2)
Ht1

Ht2

Ht3

Ht4

Draw heatmaps side-by-side on common DEGs

Ht1 = DE_heatmap(DESeq_nc_DE, common_id=Compare_method$common_id, title = "DESeq", km = 2)
Ht2 = DE_heatmap(DESeq2_nc_DE,common_id=Compare_method$common_id, title = "DESeq2",km = 2)
Ht3 = DE_heatmap(edgeR_nc_DE, common_id=Compare_method$common_id, title = "edgeR", km = 2)
Ht4 = DE_heatmap(limma_nc_DE, common_id=Compare_method$common_id, title = "limma", km = 2)

Ht1 + Ht2 + Ht3 + Ht4

Draw MA plot

par(mfrow=c(2,2), mar=c(5.1, 4.6, 4.1, 1.6))
draw_MA(DESeq_res, type="DESeq")
draw_MA(DESeq2_res, type="DESeq2")
draw_MA(edgeR_res, type="edgeR")
draw_MA(limma_res, type="limma")

Draw Volcano plot

par(mfrow=c(2,2), mar=c(5.1, 4.6, 4.1, 1.6))
draw_volcano(DESeq_res, type="DESeq")
draw_volcano(DESeq2_res, type="DESeq2")
draw_volcano(edgeR_res, type="edgeR")
draw_volcano(limma_res, type="limma")

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] plyr_1.8.4                  VennDiagram_1.6.20         
##  [3] futile.logger_1.4.3         ComplexHeatmap_2.0.0       
##  [5] e1071_1.7-2                 data.table_1.12.4          
##  [7] DESeq2_1.24.0               SummarizedExperiment_1.14.1
##  [9] DelayedArray_0.10.0         BiocParallel_1.18.1        
## [11] matrixStats_0.55.0          GenomicRanges_1.36.1       
## [13] GenomeInfoDb_1.20.0         IRanges_2.18.3             
## [15] S4Vectors_0.22.1            DESeq_1.36.0               
## [17] lattice_0.20-38             locfit_1.5-9.1             
## [19] Biobase_2.44.0              BiocGenerics_0.30.0        
## [21] cowplot_1.0.0               DT_0.9                     
## [23] reshape2_1.4.3              ggplot2_3.2.1              
## [25] gplots_3.0.1.1              RColorBrewer_1.1-2         
## [27] edgeR_3.26.8                limma_3.40.6               
## [29] pacman_0.5.1               
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.4-1       rjson_0.2.20           class_7.3-15          
##  [4] circlize_0.4.8         htmlTable_1.13.2       XVector_0.24.0        
##  [7] GlobalOptions_0.1.1    base64enc_0.1-3        clue_0.3-57           
## [10] rstudioapi_0.10        bit64_0.9-7            AnnotationDbi_1.46.1  
## [13] splines_3.6.1          geneplotter_1.62.0     knitr_1.25            
## [16] zeallot_0.1.0          jsonlite_1.6           Formula_1.2-3         
## [19] annotate_1.62.0        cluster_2.1.0          png_0.1-7             
## [22] shiny_1.4.0            compiler_3.6.1         backports_1.1.5       
## [25] fastmap_1.0.1          assertthat_0.2.1       Matrix_1.2-17         
## [28] lazyeval_0.2.2         later_1.0.0            formatR_1.7           
## [31] acepack_1.4.1          htmltools_0.4.0        tools_3.6.1           
## [34] gtable_0.3.0           glue_1.3.1             GenomeInfoDbData_1.2.1
## [37] dplyr_0.8.3            Rcpp_1.0.2             vctrs_0.2.0           
## [40] gdata_2.18.0           crosstalk_1.0.0        xfun_0.10             
## [43] stringr_1.4.0          mime_0.7               gtools_3.8.1          
## [46] XML_3.98-1.20          zlibbioc_1.30.0        scales_1.0.0          
## [49] promises_1.1.0         lambda.r_1.2.4         yaml_2.2.0            
## [52] curl_4.2               memoise_1.1.0          gridExtra_2.3         
## [55] rpart_4.1-15           latticeExtra_0.6-28    stringi_1.4.3         
## [58] RSQLite_2.1.2          genefilter_1.66.0      checkmate_1.9.4       
## [61] caTools_1.17.1.2       shape_1.4.4            rlang_0.4.0           
## [64] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14         
## [67] purrr_0.3.2            labeling_0.3           htmlwidgets_1.5.1     
## [70] bit_1.1-14             tidyselect_0.2.5       magrittr_1.5          
## [73] R6_2.4.0               Hmisc_4.2-0            DBI_1.0.0             
## [76] pillar_1.4.2           foreign_0.8-72         withr_2.1.2           
## [79] survival_2.44-1.1      RCurl_1.95-4.12        nnet_7.3-12           
## [82] tibble_2.1.3           crayon_1.3.4           futile.options_1.0.1  
## [85] KernSmooth_2.23-15     rmarkdown_1.16         GetoptLong_0.1.7      
## [88] blob_1.2.0             digest_0.6.21          xtable_1.8-4          
## [91] httpuv_1.5.2           munsell_0.5.0